Principal Component Analysis

PCA (Principal Component Analysis) is a technique that finds the directions of maximum variance in your data, then projects the data onto those directions — reducing dimensions while keeping as much information as possible.

1. Mean centering

\[\bar{\mathbf{x}} = \frac{1}{n} \sum_{i=1}^{n} \mathbf{x}_i\]

\[\tilde{\mathbf{x}}_i = \mathbf{x}_i - \bar{\mathbf{x}}\]

2. Covariance matrix

\[\mathbf{C} = \frac{1}{n-1} \sum_{i=1}^{n} \tilde{\mathbf{x}}_i \tilde{\mathbf{x}}_i^\top = \frac{1}{n-1} \tilde{\mathbf{X}}^\top \tilde{\mathbf{X}}\]

where \(\tilde{\mathbf{X}} \in \mathbb{R}^{n \times d}\) is the centered data matrix.

3. Eigendecomposition

\[\mathbf{C} \mathbf{v}_k = \lambda_k \mathbf{v}_k, \quad k = 1, \dots, d\]

with the eigenvectors ordered such that \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0\).

4. Principal components matrix

\[\mathbf{V}_K = \begin{bmatrix} \mathbf{v}_1 & \mathbf{v}_2 & \cdots & \mathbf{v}_K \end{bmatrix} \in \mathbb{R}^{d \times K}\]

5. Projection (encoding)

\[\mathbf{Z} = \tilde{\mathbf{X}} \mathbf{V}_K \in \mathbb{R}^{n \times K}\]

Each row \(\mathbf{z}_i = \mathbf{V}_K^\top \tilde{\mathbf{x}}_i\) is the low-dimensional representation of point \(i\).

6. Reconstruction (decoding)

\[\hat{\mathbf{X}} = \mathbf{Z} \mathbf{V}_K^\top + \bar{\mathbf{x}} = \tilde{\mathbf{X}} \mathbf{V}_K \mathbf{V}_K^\top + \bar{\mathbf{x}}\]

7. Reconstruction error

\[\mathcal{L} = \frac{1}{n} \sum_{i=1}^{n} \left\| \mathbf{x}_i - \hat{\mathbf{x}}_i \right\|^2 = \sum_{k=K+1}^{d} \lambda_k\]

The discarded eigenvalues exactly equal the mean squared reconstruction error.

8. Variance explained

\[\text{VE}(K) = \frac{\sum_{k=1}^{K} \lambda_k}{\sum_{k=1}^{d} \lambda_k}\]

9. SVD equivalence

PCA can be computed directly via the SVD of the centered data matrix, avoiding explicit covariance computation:

\[\tilde{\mathbf{X}} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}^\top\]

The principal directions are the right singular vectors \(\mathbf{V}\), and the eigenvalues relate to singular values by:

\[\lambda_k = \frac{\sigma_k^2}{n - 1}\]

The scores are then \(\mathbf{Z} = \mathbf{U}_K \mathbf{\Sigma}_K\).

Question:

What does direction of maximum variance mean?

Answer:

Variance is a measure of how spread out numbers are. When we talk about a direction in 2D or 3D space, the variance in that direction is how spread out the data points are when you squash them onto a line pointing that way.Imagine shining a flashlight on the data cloud from different angles and measuring how long the shadow is. The direction that casts the longest shadow is the direction of maximum variance.

More precisely: for a unit vector \(\mathbf{v}\), the variance of the data projected onto it is \(\mathbf{v}^\top \mathbf{C}\, \mathbf{v}\). PCA finds the \(\mathbf{v}\) that maximises this.Drag the angle slider and notice two things:

The shadow widens and narrows. The right panel shows the 1D “shadow” of the data onto the chosen line. A wide, spread-out shadow means high variance in that direction. A narrow, squished shadow means low variance.

The bar fills toward green. The fill bar shows what fraction of the maximum possible variance you’re capturing. It hits 100% exactly at the direction PCA would find — the eigenvector of the covariance matrix.

Formally, the variance along unit vector \(\mathbf{v}\) is:

\[\text{Var}(\mathbf{v}) = \mathbf{v}^\top \mathbf{C}\, \mathbf{v}\]

PCA solves \(\max_{\|\mathbf{v}\|=1} \mathbf{v}^\top \mathbf{C}\, \mathbf{v}\), and the solution is the eigenvector with the largest eigenvalue. That eigenvalue \(\lambda_1\) is exactly the maximum variance — which is why eigenvalues appear in the variance-explained formula.

Question:

What is the formula for variance for any arbitrary vector which data is projected to?

Answer:

For a unit vector \(\mathbf{v} \in \mathbb{R}^d\) with \(\|\mathbf{v}\| = 1\), the variance of the centered data projected onto it is:

\[\text{Var}(\mathbf{v}) = \mathbf{v}^\top \mathbf{C}\, \mathbf{v}\]

where \(\mathbf{C} = \frac{1}{n-1}\tilde{\mathbf{X}}^\top \tilde{\mathbf{X}}\) is the covariance matrix.

Written out from first principles, the scalar projection of each point \(\tilde{\mathbf{x}}_i\) onto \(\mathbf{v}\) is \(z_i = \mathbf{v}^\top \tilde{\mathbf{x}}_i\), and since the data is already centered the mean projection is zero, so:

\[\text{Var}(\mathbf{v}) = \frac{1}{n-1} \sum_{i=1}^{n} z_i^2 = \frac{1}{n-1} \sum_{i=1}^{n} (\mathbf{v}^\top \tilde{\mathbf{x}}_i)^2 = \frac{1}{n-1} \sum_{i=1}^{n} \mathbf{v}^\top \tilde{\mathbf{x}}_i \tilde{\mathbf{x}}_i^\top \mathbf{v} = \mathbf{v}^\top \!\left(\frac{1}{n-1}\sum_{i=1}^n \tilde{\mathbf{x}}_i \tilde{\mathbf{x}}_i^\top\right) \mathbf{v} = \mathbf{v}^\top \mathbf{C}\, \mathbf{v}\]

If \(\mathbf{v}\) is not a unit vector, the normalised version is:

\[\text{Var}(\mathbf{v}) = \frac{\mathbf{v}^\top \mathbf{C}\, \mathbf{v}}{\mathbf{v}^\top \mathbf{v}}\]

This is the Rayleigh quotient of \(\mathbf{C}\). Its maximum over all \(\mathbf{v}\) is \(\lambda_1\) (the largest eigenvalue), achieved when \(\mathbf{v} = \mathbf{v}_1\) (the corresponding eigenvector) — which is exactly what PCA finds.